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Abstract In this paper, we present the results of the first phase of a project 
aimed at implementing a full suite of IPSec cryptographic transformations in 
reconfigurable hardware. FulJ implementations of the new Advanced 
Encryption Standard, Rijndael, and the older American federal standard, Triple 
DES, were developed and experimentally tested using the SLAAC-1V FPGA 
accelerator board, based on Xibnx Vinex 1000 devices. The experimental clock 
frequencies were equal to 91 MH2 for Triple DES, and 52 MHz for Rijndael. 
This translates to the throughputs of 116 Mbit/s for Triple DES, and 577, 488, 
and 423 Mbit/s for Rijndael with 128-. 192-, and 256-bit keys respectively. We 
also demonstrate a capability to enhance our circuit to handle the encryption 
and decryption throughputs of over 1 Gbit/s regardless of the chosen algorithm. 
Our estimates show that . this gigabit-rate, double-algorithm, 
encryption/decryption circuit will fit in one Virtex 1000 FPGA taking 
approximately 80% of the area. 



1. Introduction 

IPSec is a set of protocols for protecting communication through the Internet at the IP 
(Internet Protocol) Layer [15, 22]. One of the primary applications of this protocol is 
an implementation of Virtual Private Networks (VPNs). Jn IPSec Tunnel Mode, 
multiple private local area networks are connected through the Internet as shown in 
Fig. la. Since the Internet is an untrustworthy network, a secure tunnel must be 
created between security gateways (such as firewalls or routers) belonging to private 
networks involved in the communication. The information passing through the secure 
tunnel is encrypted and authenticated. Additionally, the original IP header, containing 
the sender's and receiver's addresses is also encrypted, and replaced by a new header 
including only information about the security gateway addresses. This way a limited 
resistance against the traffic control analysis is accomplished. A second use of IPSec 
is client-io-server or peer-to-peer encryption and authentication (see Fig. lb). In 
IPSec Transport Mode , many independent pair-wise encryption sessions may exist 
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a) Tunnel mode b) Transport mode 




Fig. 1. IPSec Tunnel and Transport Modes 



simultaneously. The large number of connections and high bandwidth supported by a . 
single security gateway or sewer suggests the use of hardware* accelerators 'for* 
implementing cryptographic transformations. 

The suite of cryptographic algorithms' usecTfbr encryption^ and autHehticalioh in 
IPSec is constantly evolving. In the case of ericryptionV cun-ent :t jmplemehtations- 1( of 
IPSec are required to support DES, and have the option* of supporting Triple DES, 
RC5, IDEA, Blowfish; and CA$T-128i Since DES Has beeri shown to be vulnerable 
lo an exhaustive key-search attack using the computational resources of a -single 1 
corporation [2], the current implementations of IPSec typicallysupport Triple DES. In 
1997, the National Institute of Standards and Technology (N1ST) initiated ah' effort 
towards developing a'Wew encryption standard, called AES (Advanced Encryption 
Standard) [1]. The development of the new standard was organized in the- *f orrn of a 
contest coordinated by N1ST. In October 2000, Rijndaer was Announced as the winner 
of the contest and a future Advanced Encryption Standard. In November 2000/ a first ' 
Internet-draft was issued, proposing including 1 AES-Rijndael as'a required encryption 
algorithm in IPSec; with the remaining AES ; contest finalists as -optional algorithms to 
be used in selected applications [1 1].. ' " - ■ * - '-• V - • *^ - 

: An encryption algorithm is not the only part of IPSec- that is currently being 
extended and modified: Other* modifications currently being considered include 
different modes of operation for encryption 1 algorithms- fl tfj, hash functions Used by 
authentication algorithms [21], type and' para me ifcrs of public key cry'ptosyslems used 
by a key management protocol, etc. The Jhsi' and/ hard to pfcdici'evoliitiori vf IPSec 
algorithms' leads naturally to prototype and commercial implementations based -on 
reconfigurable hardware. ' ' *' ,s •■ * •* ' ' * ' fi n V. '' l -' jlr " ' * - v 
An FPGA implement at ion' 'can be easily upgraded to incorporate arty* protocol' 
changes without "ihe^rieecl lor expensive and time-consuming physical design;' 
fabrication, and testing ^required in case of ASICs: Additional capabilities appear 
when an FPGA 'accelerator Supports a real-time partial reconfiguration: In this casei 
the 1 accelerator can reconfigure itself on the fly to adapt to 1 ; r : 

• traffic conditions, (c.g./'Hy changing 1 the number of packet* streams ^processed* 
.simultaneously),' *' * . } ' ' 1 ' ' .' " \ ' ' ' ' L . '• 

• phase 'of the protocol '(e.g., 'by using" the same FPGA ■ with iime ! sharing for 
implementing key exchange, encryption, and authentication), 

• various key sizes. and parameter values (e.g., by adjusting the circuit architecture to 
different key sizes and different, ^ values of system parameters). ' t . 
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Additionally, several opiional IPSec-compliant encryption, authentication, and key 
exchange algorithms can be implemented, and their bilstreams stored in the cache 
memory on the FPGA board. •* '"Algorithm agility- "accomplished _this -.way can 
substantially increase the system trlteroperability. ^g* ... f ; 

In "this paper, we present the Jesuits" of the first phase of our project aimed at 
implementing, a full suite of IPSec cryptographic transformations -using SLAAC-lV* 
FPGA board. In this phase, two encryption algorithms AES-Rijndael.and Triple DES 
were implemented and experimentally tested in our environment. 



2. EPpA Board i \ . ./ ; ... [. x t> 

The SLAAG-1V PCI board is. an ^PGATbase^i.cqmputatipn .accelerator developed 
under; a. DARPA r funded - project t called Sysjems-Jjevei Applications of Adaptive . 
Computing (SLAAC>. This project, led by USC J nformatibn Sciences Institute (ISI), 
investigated the use of. . adaptive. ; , computing; platforms for ..open, scalable, * 
hete/ogenepus --cluster-based computing on high-speed networks. Under the SLAAC 
project, ISI developed several FPGA : based computing platfqnns and a high-level . ^ 
distributed programming model for FPGA-acceJerated cluster computing [16]. About ^ 
a dozen, universi.ties ; and research labs are; using T SLAAC- 1 V fpr. a variety of signal ^ 
and image processing applications. ._ v; -, . ;i - ( i* 

The SLAACTl.Vrboard. architecture is based on three user-programmable Xilinx . ^. ^ 
Vinex-XCV- 1000-6 FPGA device^ Each. of Jhese devices is . composed of 12,288 
basic logic cells .Referred, to .as CL^ i <Gpnhgurabic Logic Bloc^). slices, and\jncludcs ^ 
32 4-kbit blocks of , synchronous,, duaf-poned ,RAM. All devices can achieve ^ 
synchronous system clock rates up to 200 MHz, including, input/output interface. ; 

The logical architecture of ; SLA AC- IV is shown in Fig. 2. The three Yirtex J 000 
FPGAs (denoted as X0, XI, and. X2) are the primary processing element's. They are. 
connected by a 72-bit,, "ring" path as : well. af ; a 7 2 ; tyt, scared bus. The width of bplh . 
buses supports, an bit; control ■. tag ..associated v wiih each- 64-bil data word. The 
direction pf each. line of both buses con ( ,ba.cpnirolled jndepcndently. The, processing 
elements. are,,connecjed io % ten /2£6j£, x : 3.6fbit SRAtys* (Sialip ^andoin ^Access;. 
Memories) located on mezzanine cards. The FPGAs XI and X2 are.cqc^ connected to 
four. SRAMs, while X0 is e,onnecied..io two.. The ^memc^cards J 
switches th^i allow the host to directly access all memories through, XO,^. .. } 

About 20% i. of the resoijrces in the X0 FPGA are devoted to the PCI .interface aqd . 
board control* module. .The remaining logic,of.ihis device (as. well as the entire XI a_n"d- 
X2 FPGAs) can be used by the application dcyelopeh' The 32/33 control module 
release u§es the Xilinx 32-bit 33JvJH*/. PCK core*.. The conju-dl module provides nigl)- 
speed DMA (Direct Memory Access), data buffering, clock control (including single- 
stepping and frequency syntheses from 1 to ( 200 MHz), user : prograinmable interrupts, 
etc. The current 32/33 control module hps 'obtained DMX transfer rales of over 1 
Gbit/s (125 MB/s) from the host memory',' very rieaf the : PCl theoretical maximum. 
The bandwidth for SLAAC-IV using the 64-bk"66^Hz t PCI contfolier (using the 
Xilinx 64-bit 66MHz core)' has been measured at 2.2 Gbil/s. The user's design located 
in X0 is connected to the PCI core via two 256-deep, 64-bit wide FIFOs. The DMA 
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Fig. 2. SLAAC-1 V Architecture 



controller located in the interface pan -of XQ can transfer data to or from these FIFOs 
as well as to provide fast commu nic ay q a between the L hosi and the board SRAMs. The . 
DMA controller load balances input and; output FIFOs, anjj, can. process large memory 
buffers without host processor /interact ion.. Current interface development includes 
managing memory buffer rings on, the jFPGA.to -rmrumize, hbst. interrupts on small t 
buffers. * , : ' y' . : f > ^ 

SLA AC- IV. supports partial , reconfiguration, ,<in which L part of an t FPpA is 
reconfigured while the rest of the FPGA : remains active and continues lo.cornpute. A 
small dedicated Virtex 3 00 configuration; control device is used to configure .all 
FPGAs and manages 6 MB of flash / SRAlj4 as ( a configuration, "cache!*. . .'. 

The work, discussed. in .this paper was done in collaboration with. lhe ; l SI Gigabit- 
Rate .IPSec (GRIP) project, which is funded in .thc DARPA l$ext Generation. Internet - . 
(NG1) program. The GRIP team (has . constructed a gigabit .Ethernet daughter.,. card 
which connects, to SLAAQIV, in place of trys : crossbar connection of the XO^chip. To 
the host, the. SLA/VC-1 V,/ GRIP system appears to be, a gigabit, Ethernet card with 
optional, acceleration features. The .GRIP team, is currently, customizing the TCP/IP. 
stack for the Linux operating system to take advantage of the hard y^are .acceleration in. 
order; to deliver fully-securq, ful]y-authenticaierj giga,bit-rate .traffic to the desktop. 



3. Implementation of Rijndael 



Rijndael is a symmetric key block cipher with a variable key size and a variable 
input/output block size. Our implementation supports all three key sizes required by 
the draft version of the AliS standard, 128; 192, and 256 t>ils. Our key 'scheduling unit 
is referred to as 3-in-K which means that it can'process all three key sizes) Switching- 
from one key size ib the other is instantaneous, and is triggered by the appropriate 
control signals. Our implementation is limited to the blbck* size df 128-bils, which is 
the only block-size required by' Advanced Encryption Standard. -Implementing other 
block sizes, specified in the original, non-standardized description of Rijndael is not 
justified from the economical : pdint of view, as it would substantially increase circuit 
area and cost without any substantial gain in the cipher-sccurily. 1 ^ . 
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Rijndael is a substitution-line^" transfonna^on cipher based on S-boxes and 
operations in the Galois Fields. Below^we descritejthe way of implementing all 
component operations of RijndaeLjiijid then, present, how these basic operations are 
combined together to form the entire* encoT?li9 n/ PiSP?7f ,lion uniu - 



r-w_ - - 



3.1 Component Operations r 

" ^ f\ 

Implementation of the encryption round of ^ndael requires realization of four 
component operations: ByteSub, ShiftRow;^ Mixfcolurnn, and AddRoundKey. 
Implementation of the decryption round of Rijndael requires four inverse operations 
InvByteSub, InvShiftRow, InvMixColumn, and AddRoundKey. 

ByteSub is composed of sixteen identical 8x8 S-boxes working in parallel. 
InvByteSub is con^dsed of* the "same number of gx8-bil inverse S-boxes. Each of 
these S-boxes can be implemented independently using a '256 x 8-bit look-up table. 

A Virtex XCV- 1000 device contains 32 4-kbit Block Select RAMs. Each of these 
merhory : blocks is a : synchronous,- dual-ported ~RAM with the data 5 port width 
configurable to an arbitrary power of twb in the range from 1 to 16. -Each memory 
block can be used to realize two table look-ups per clock cycle, one for each data port. 

In particular, each 4-k bit Block Select RAM can -be configured as a 512 x 8-bit 
dual-port memory! If 'encryption or decryption' ate implemented separately, only the 
firkt 256%ytes of each memory block are utilised as-a : look-up tabic. If encryption and 
decryption are implemented together wi'thin'thc- same FPGA,- both uhinverted and 
inverted 256 ttyte"fodk-u^ one memory block; In each case, 

J 6 'daia bits 'are ^processed by 1 dfie memory* block- which means that a total of 8 
memory blocks are heeded lef process the entire 128-bit input/ { '; 

ShiftRow and InvShiftRow change the* order of by tes ; within a 16-byte (128-bit) 
w&rd. Both ifahsforVnatioris involveu)rily I 'chariging the order of signals, and therefore 
tWey'can be implemented using fbuting r bnJy, and do nol require any logic resources,. 
sucH'as CtiBs br'tlcdicated RAM. - : r : '* j i ' * - ^ : 

The MixCdtumn transformation can be* expressed as a matrix 'multiplication in 
the Galois Field GF{2 8 ): 



(1) 



the.jSalois Field. Each of these elements can^be treated -,as,a polynomial of degree 
seven -or : less, with coefficients in (0,1 J. # determined by: the respective bits of the 
GF(2 K ) element: For;example, '03 1 is equivalent to WOO 00J V in binany, anc! to . 5-i . 

. C(x) = O^ 7 ^.0.x 6 + O : x^Ox J + O J x 3 + Ox 2 : + i;.x v + 10 =x.^l (2) 
in the polynomial basts representation. . » :v • ^ * ; 

* The multiplication, of elements of GF(2®> is accomplished by multiplying the 
corresponding polynomials modulo a fixed irreducible, polynomial ... 
m(x) = x 8 ■ " 4 1 ~ 3 * ~ ■ 1 
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For example, multiplying a variable element A=a 7 a^ a 5 a 4 a 3 a2 aj ao by a constant 
element '03' is equivalent to computing - 

B(x) = b 7 x 7 + b 6 x 6 + b 5 x 5 +. b 4 .x* + b* x 3 + b 2 x 2 + b, x '+ b 0 = (4) 
=(a 7 x 7 + a 6 x° + a 5 x 5 + a4 x 4 -+ a 3 -x 3 -f a 2 x 2 + aj x + a©) * (x+1 ) 
1 * mod(x B + x 4 + x 3 +.x + l). 
After several simple transformations 

B(x) = (a 7 + ao) x 7 + (a^+ a 5 ) x*. + (a ? + a^) x 5 + (a4 +'-a ? + a 7 ) x 4 + (a 3 + a 2 + a 7 ) x 3 + 
+ (a 2 + a'i>x 2 _+!(ai.+ a 0 + a* 7 ) x:+ (5o+ a 7 ), (5) 
where V represents an addition modulo 2, Le. an XOR operation. 

Each bit of a product B, can be represented as an XOR function of at most three 

variable input bits, e.g., b 7 = (a 7 + a*), b 4 =,.(£4* ^3+ a 7 ), etc. . 

Each byte of the result of a matrix multiplication (1) is an XOR of four bytes 
representing the Galois Field product of a* byte A 0 , A u A 2 , or A 3 -by a respective 
constant. As a result, the entire MixColumn transformation can be performed using 
two layers of XOR gates, with-up* \° 3-input gates in thejfirst layer, and 4-input gates 
in the second layer. In ViriexTFPGAs, each of-these XOR operations requires only one 
lookup table (i.e., a half of a CLB siice)] ; " ~ ^ 

The InvMixColumn transformation can be expressed as a following matrix 
multiplication in GF(2 8 ). . - 



*Ao" 




"0E 


OB 


OD 


09" 


Bo" 






Ai 




09 


OE 


OB 


od 


B.' 




(6) 


A 2 




OD 


09 


OE 


0B 


B: 








;0B'* 




09' 


OE 






ij"/ :.' 



The primary differences, compared, to. MixColumn, are, the larger hexadecimal, 
values, of the, matrix , coefficients^ elements of .the 

Galois Field leads to the more complex dependence between the bits of a 'variable"* 
input and the bits of a respective product. For example, tlie muiliplicatioh A=*0E' ; B, 
leads, to ths following dependence heiwcen the hits of A and B: ^ 

* ■ a 7 = b 7 + b 6 + b 5 + b 4 * ' ■ " , ' r ' - ' '(7). 

w > . , a 6 = f bf, + .b ? + .b4+ / b3+, li b ?; * ... \. r . - t . /: (8-) 

- r 1 f a5= bs * ■t > 4+ > jf>3> b2^bo.''; ; 7 V". r \ 

•\ ... ,' : t .\ '* b4± ( b 3 +b 2 + b't.+% [ V-pf-',/-., *" . 00) 

t : , /. . a.y= bi + b 2 + bj bj+'bs j. : r. " ? ," 

J . V 4 . .'a 2 = b 2 + b,+.l3q + he v *'Y.V'/ ,."02)| 
' ( • " aj'=b,+b 0 +b^ ' %> \ 4i • " * ' .(135/ 

i . . ' ,a 0 = boE+.b^-f b 6 +bs. \ " ' .. „, ; - ( ]4 ) 

The entire InvMixColumn tr^nsfprmaiion can be performed using two layers of XOR 
gates, wilh.up to 6,-input gaies in, the' first layer, and 4-input gates in the second layer. 
Because of the use of gates with the larger number of inputs, the InvMixColumn 
transformation has a longer t criucaJ. .path compared 10 the MixColumn transformation, 
and the entire decryption. is i.more^ijpe.consuming than encryption. t " r '/ 

AddRoundKey is,a bitwise, XOR of two 128rbii f words and can he impjemehted 
using one, layer of 128 lookrup tables, which ( translates to, 64 CLB slices. Assuming 
that one operand of the bitwise XOR js. fixed, this operation i& an inverse of itself, [so 
no special transformation is requiredrfor decryption.. . . 
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Fig. 3. The general architecture of Rijndael 



3.2 General Architecture of the Encryption/Decryption Unit 

The* 'block r diagrams' of the "/encryp\ip^ ''in the basic 1 iterative 

architecture and in the extended pipelined architecture' a 3.. Only- 

registers Rl, R3, R4, and ft5 (shaded rechaji»lcs iii Fig." 3) are present irrihe basic 
iterative' architecture' : The ren^ai.n^n'g 1 rc^isib'rs (frans parent* rectangles in Fig. 3) have 
been added in the extended architecture based on inc : concept of "inher-rouhd - 
pipelining. 

The register Rl is a part of Block SelectRAM, ttfd synchronous dedicated 
memory, used to implement ByteSub and InvbyteSub transformations, so it was 
chosen as a basic register in the basic* iterative architecture. In this architecture, 1 1, 
1-3, land 15 clock cycles are required* in order lo process one block of data for 1 28-, 
192,-, and 256-bit keys respectively. The critical' path" is located in the decryption 
circuit, and includes AddRoundKey (an' r xor operation), InvMixColumn, 
JnvShifiRow, multiplexer, and InvByleSub (memory read). Iris' important lo note that 
our decryption circuit has a structure border of .operations) similar to the* encryption 
circuit, .r^ul \still, : does not require any additional processing of round keys' (uhfrke the* 
architecture sVg&esVed in [5] and adopted in 18; 0]). ' ' ' '* f ' * 

Introducing pipeline registers R2"a-c and RO 'altows "the circuit toprocess two" 
independent streams of data at the same time: Our 'architecture assumes the use of the 
Cipher Block Chaining (CBQ mode 'for processing long Streams of data; The CBC 
mode is the only mode required by the' current specification of IPSec to be used with 
DES and all optional IPSec encryption algonlhriis/H is also most likely to be" the first 
mode recommended for use together with AES^'The 'encryption and decryption ■ 
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Fig. 4. Cipher Block Chaining Mode a) encryption, b) decryption 

in the CBC mode are shown ,in Fig. 4. An jniti.alization vector IV is different for each 
packet and is transmitted in clear as a part of ihlv packet header. The CBC mode 
allows concurrent eneryptipn of blocks belonging to different packets, bufnot to the 
same packet. This limitation comes from the fact that the encryption of any block of 
data cannot begin before the cipheriext of >lhc previous block 'becomes available (see 
Fig. 4a). The same limitation does not apply to decryption, where alT blocks can be 
processed in parallel. 

In our implementation,' the memory buffers Ivtf; M2; 'an^IvI3 are usi?dAo slbre ihc 
last (i.e., the most recently -processed) ciphenext blocks for; tip : to; 1 6 independent 
streams of data. Before the processing of the given StreanT begins,' the corresponding 
memory location is sei ; to the initialization vector used during the encryption or 
decryption of the first block : of data. r/ . r . .. i 

Our architeciure J allows the Wuiltaheous encryption of two blocks r belonging to 
two different packets, and the simultaneous decryption of two blocks belonging loathe 
same packet or two different packets. 

The new secret-key block cipher modes, currently under investigation by NIST; 
are likely to allow unlimited parallel encryptioKand decryption of blocks belonging to 
the same packet [J 8]. An example of such a mode, likely to be adopted by NIST in 
the near future, is a counter mode Our implementation will, be extended .to. 

permit such new modes as soon as they become adapted as draft standards. 

Our architecture, can be extended by adding additional/outer-round pipeline 
stages,, or. implementing multiple instantiation*. of; the same encryption/decryption 
unit,. and using (hem. for para 1 IeJ. processing of, data. The. total throughput in, these 
extended architectures is direct [y proportional lot he,a mount of resources. (CLB slices, 
dedicated RAMs) devoted to. the cryptographic transformations.' [' ' > 



3.3 Round Key Module - , t . " , t - 

The round key module consists of the 3-in-] key. scheduling unit 1 and 16 banks of- 
round keys. The banks of round-keys are implemented using 8 Block SelectRAMs 
configured as twp, memories 256. x ,64 bits. These memories permit .storing up to - 16 
dj.fferent sets of round keys, with 1 6 ( consecutive "memory locations reserved for'eaph 
set. Each set of subkeys may correspond to a different .main .key anc) a different 
security association. 

The 3-in- 1 key scheduling unit of Rijndael is shown in Fig. 5a. The operation of 
the circuit is described by formulas given in Fig. 5b. The unit is capable of computing 
two 32-bit words of the key material (wj and w i+l ) per one clock cycle, independently 
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input 64 bits 




b) . 

V- For i mod Nk = 0 -~ 
- w,= w^^Sub<RoVw^)© Rcon i/i* 

e SubtRottWi.,)) © Rcon i/Nk 
~ 9 ' for Nk=8and i mod Nk = A. 



otherwise 



Fig. 5. The 3-in-l key. scheduling unit of Rijndael: a^he.inain circuit, b) formulas describing 
theQperaiion.ofthecjrcu.it, . . . . t , r « • - ... • •* - * * 

of the si;/.e .of the mainJkey. Since each round.keyjs 128 bit long (the size of the input 
bloclc) two clock cycles, are required to calculate each round key. Therefore, our key 
scheduling unit is not designed .for computing subkeys on the fly. Instead, all round 
keys corresponding to, the new main key are computed in advance an^.stored in one of 
the. memory banks. This computation can be performed in parallel with encrypting 
data.. using previous main key, Uieref pre key^ scheduling docs not impose any 
performance penal iy v . ; . . ' >i i - . 



r. v 



4 Implementation of Triple; DES 



4.1 Basic Architecture : ,r * 

In-order to realize the- Triple DES 1 encryption arid decryption' it is sufficient to 
implement only' one round of DES,' as ^h(>wn in Fig: 6a. 'THc multiplexers /mu7 and 
nuix2 permit loading new data block or feed back the result of the previous iteration. 
Only' half of ihe-dara ^lbck"HiHr'an^riirfned in each' iteration, arid this transformation 
depends on a round key - coming 1 from 1 the 5 key module. The DES-specillc 
transformation function Fhas been implemented as a combinational logic and directly 
follows the algorithm specification. The multiplexers mux3 and mux4 choose the right . 
feedback for consecutive iterations. In the single DES htfprefndrit'a lion, ' these' 
multiplexers would not be required, because the feedback is always the same. 
However, this is noi the case forTriplc DES because of the data swapping at trie end 
of the last round of DES. This feature become^ iiYJporuirtt When switching' between the 
first and the second, and between the second and llie third DES encryption in Triple' 
DES. Performing the Triple DES encryption or decryption of one data block in, the 
CBC mode requires *48 clock cycles, exactly as marry as 'ihe number of the cipher 
rounds. 
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Fig. 6. Basic iterative architecture' 6 ; f Triple DES a) encryption/decryption unit, b)-'kcy 
scheduling unit 



4.2 Round Key Module 

The DES key schedule, which, serves as a basis lor the' Triple DES key schedule; 
consists of very simple operations. Consecutive round keys are computed by rotating* 
two. halves of the main 56-bit key by' one or' two positions depending PnM he number 
of the round. The result of each next rotation goes through the Permuted Choice-2 
function (PC-2), which selects 7 ^ bits of a round key. Since VES key scheduling * 
requires much simpler operations than encryption/decryption unit, it can be easily 
performed on .the, fly. ;This way only " three $6- bit' keys need to be stored on-chip. Our 
Triple DES key schedulingunhis'sh^Wh in Fig .' _ . ,l : 

Four banks of .the key memories ( are /placed ! at_ th'e input to the key r scheduling 
circuit. Each bank contains three DES lfcys^usec! by Triple DES; The user supplies 64- ; 
bit keys tp the circuit, but only *56 : bifs"of cadi Key' hre selected by; (he Permuted 
Choice- 1 function (PC- 1). and stored. in one of. the memory banks. Each' memory bank 
can hold all. three keys required for perTprmingTiilJle DES. AM memory banks 'are 1 
built .using dual-port memory, and can operate independently; They are organizedin a 
way that. permits, writing new key lb one of the Hanks) while any/6lher bank may he 
used for the round key computations. The output , of t,he roun'd key memory goes (o 
two simple circuits, one computes keys 'for encfypiion, the other for decryption." 

4.3 Extended Architecture -< * * / ^ ^ 

We are currehlly in the process 5 of developing an extended pipelined architecture' of ' 
Triple DES. Our goal is to dbtain 'throughput over 1 Gbil/s. Our approach is to fully 
unroll single DES and introduce pipeline registers between cipher rounds, as shown in 
Fig. 7. This leads to a capability of processing up to ,16 independent data streams, 
which gives a throughput of around 1.5 Gbit/s. We should be able to maintain clock 
frequency at the similar =or even greater level, since this architecture permits 
significant simplifications compared" to the basic iterative architecture. Namely, 
multiplexers mux3 and mux4 are no longer required in any of lhe stages (see Fig. 6b), 
and key scheduling can he greatly .simplified as shown in Fig. 7b. . t < , , 
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Fig. 7. Extended architecture of Triple DES: a) main circuit, b) next key module; the number of 
rotations m depends on the round number n, and-can.be equal to 0, 1 , or 2 



5. Testing Procedure 

Our testing procedure is composed . of three groups of tests. The first group is aimed al 
verifying the circuit functionality at a single clock frequency. The goal of the second ' 
group. is. to determine the maximum 'elbe'k/ frequency 'aV which the circuit operates 
correctly. Finally,, the purpose of the third group is to determine the' limit on the 
maximum encryption and decryption throughput, . "talGng into'account the limitations 
of the PCI interface. V - -V .. . . 

* pur Jirst group of tests' "is based ohjhe^Nl ST 'Special Publication 800-20, which 
defines testing procedures' for Triple DES jmplcmentations in ECB„CBC; CFB and 
OFB modes of operation [20]. This,pubncaiio n recommends; two classes of tests for 
verification of the circuit functiohaliiy:/ Known Answer Tests (KATs)', and the Monte- 
Carlo^ tests. Since,, trie .Known Answer, Tests are algorithm specific, we implemented 
them only for Triple b£$?. The JvVonte .'Carlo test is algorithm independent; so we 
implemented. it for both. Triple DES and Rijridael'. The operation of this' test is shown 
in Fig r 8. .The test consists of "4, Ooj^ keys changed every 10,000* 

encryptions. The ciphertexi block oV lainc d ? fle ' r ^ ac ^ sequence of 10,000 encryptions 
is compared with the 'corresponding ( bl T CKk obtained using software 'iriiplementaii on: 
Software Jrnplemenfa^ Ryndael , from publicly "available 

Crypto++ 4.1' library were used in oiir experiments.* " " ' * 

The second group of tests was developed based on the principle similar to the 
Monte-Carlo tests. One megabyte of data is sent to the boarc^.for encryption. .(or 
decryption), the result is transferred back to the host, and' downloaded Wain to the* 
board as a subsequent part. of input. The procedure, is repeated 1024. limes, which 
corresponds to. encr ( ypling/decrypiing^ a 1 GB, streaiiVof data using CBC mode. 'Only 




Fig. 8. Monte Carlo Test recommended by N1ST in the CBC mode-- 
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the last megabyte of output is used for verification, as it depends on all previous 
input and output blocks. The ^transfer of data is performed by the DMA unit, so it 
takes place simultaneously with encryption/decryption:' If the test passes, it is repeated 
at the increased clock^ frequency; Th^ highesLdock . frequency at which no single 
processing error has been detected is considered Jhe maximum clock frequency. In 
our experiments, this lest was .repeated 10 times with consistent results in all 

iterations. j - '*j , " *V 

The third group of tests is an extension "of The second group. After determining 
the maximum clock frequency, we measure the amount of lime necessary to process 4 
GB of data, taking into account jhe Jirrutations imposed.,^ the 32 bit/33 MHz PCI 
interface. Since data is transmi tied through the PGI interface jn both directions (inpul 
and output), the maximum encryption/decryption throughput that can be possibly 
measured using this test is equal to 528 Mbit/s. This is a half of the maximum 
throughput in - the regular operation of .the. FPG A, accelerator,, where only input' data 
are transferred from the host .to the accelerator card through the RCI interface, and the 
output is transferred from the FPGA card to the Ethernet daughter card. 



6. Results * . ( ,,: """' ' f • ' : ' * r - ri " 

The results of static liming analysis and experimental testing for ( Rijndae) and Triple 
DES are shown in Fig* 9. \\ : '' 

For Triple DES in the basic ileratiy^'archi tec ture, the maximum clock frequency , 
is equal lo, 72 MHz, according to t the static., an aiyzer v and 91 MHz accordinu to the, 
experimental testing usthg'ihe SLAAC-lV board. ^ • , . 7 

For Rijndael in the basic iterative architecture^ the results.for encryption and 
decryption arc different/ with decryption, si Qwep than- encrypt ion by,,about 13% in 
experimental testing. According to the timing aiWjyzer, the maximum clock frequency 
for the entire circuit is equal to 47 ' MHz, '.with the ^critical path .dciermined by the 
decryption qircuit. In experimental, testing, decryptions MH/,. 
and. encryption up to 60 MH/„ ; However] ; f we 'doVnlot.iinifin.d tq.'^hange,. the clock 
frequency , on the fly, , therefore 52 M H z sets the ,1 i mi i : for / the emirc eircuil. The 
differences between the static. liming analysis' and. experimental testing are^caused hy : 
conservative assumptions used by the Xilinx static liming analyzer, including the 
worst case parameters for voltage and temperature prorating. 

In Fig. 9b, the max imunv throughputs, corresponding lo the analyzed and 
experimentally tested clock frequencies are estimated ( bascd on the equation: 

Maximum ^Throughput = (Block_size I iff Rounds) ... Maximum JC lock ^Frequency. (15) 

Using formula (15), the maximum throughput of Rijndael in the basic iterative 
architecture for a 128-bit key is 52J Mbit/s ba$ed on the static timing analysis, and 
577 Mbii/s based on the , experimentally measured clock frequency. This result is 
expected to be further improved by optimizations of placement and routing. Taking 
into account our result, parallel processing of only .two streams of data should be 
sufficient to obtain the speed -over- 1 Gbit/s. As- a' 1 result, one stage of additional 
registers, R2a-c, was added , to the basic iterative architecture in the .extended 



3NSDOCID: <XP 22l9874A_l_> 



i 



Proc. Information Security Conference, Malaga, Spain, October J -J, 200 J 



a) 
175 
150 
125 
100 
75 
50 
25 
0 



Maximum clock frequency |MHi] < _ r ^ 
[2^ analjw 1 



" b) Throughput [Mbitfc] 

n 1008- 




Triple DCS 
eoc ♦ dec 



Rtjndact 

enc ' 



. coc 4 der; 



„ Banc architectures 



RJjndacI ' 
jtttc* d«*', 
E.iirod«J architecture 



Triple DES Rijadart Rijndwrl RiJndaH 

/ *coe4Wc^* enc * cnc'^cix ' enc* dec - 

' Basic architectures - EilrwW architecture . 



Fig. 9. Results of the sialic trifling analysis and experimental 'testing for Rijridael'and Triple 
DES a) maxinriirnicrDc^.fDecjirency/b) corresponding throughput : t . ->- - * 

:. . ' b i..:- ' ' O • * 

architecture as shown in Fig. 3. At this moment, we have been able to obtain a 
throughput of 887 Mbit/s for this extended pipelined architecture. Nevertheless, 
further logic and routing oplimiialions are expected to improve this throughpujL over 1 
Gbit/s without the need of introducing any additional pipeline stages. 

The worst.-ca.se throughput of Triple DES in the basic iterative architecture is 91.. 
Mbit/s * baseij ' on iKe static timing 1 "" analysis'; 'and 11 6 Mbit/s based on the. 
experimentally measured maximum clock frequency, which translates. lo the 27% 
speed-up in experiment. ! Si xVeen" 'independent streams of /data- processed 
simultaneously should easily 1 exceed 1 Gb^S, feadirig to r the extended architecture 
shown in Fig. 7. ' ]' / ' * ■ 

The 'actual encryption and*"- decry pti6h l throughputs, taking . into account ' the 
limitations imposed by ; the PCI interface were measured using the \liird group of tests 
described in Section 5! The actual throughputs for'tfESi were equal' to 102 Mbit/s for 
encryption, and 108 Mbit/s for decryption. The experimentally measured throughput' 
for Rijridael was equal to 4()4 Mbil/s/ari'd waslric same independently of the key-si/x, 
which means thai* this Ihroughpu't was limited by trie PCI interface. It should be noted 
that during the regular operation J 6f tne card, when no output. is transferred back lo the 
host memory, thi.V throughput 'carl be easily drtublccl and;' reach' at least 808'McVil/s. 
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The total percentage of the FPGA resources used for the basic iterative 
architectures of Rijndael and Triple DES is 15% of CLB slices and 56% -of-: 
BlockRAMs. The extended architectures of both ciphers, capable of operating over 1 
Gbit/s, will take approximately 80%. of CLB slices, and 56%..of Block SeleciRAMs. 
Only one Virtex XCV-1 000: FPGA : is necessary xo assure the throughput of both 
ciphers in excess of 1 Gbil/s. Using two additional Virtex devices, and more complex 
architectures, the encryption throughput in excess of 3 Gbit/s can : be .accomplished. 
Our 64-bit/66 MHz PCI module will support this bandwidth. 

7. Related Work ^ ^ , ; 



Several research ( groups developed ' ^HD^ of Rijndael in Xilinx 

FPGAs [3, 6, 7,' 12/ 14], and Altera FPDs' [8, 9, 10, 1 15]^ A/suryey; arid relative 
comparison of results from various -{groups is given in [13]. Air major results 
described in the aforementioned papers are based pn the sialic .timing analysis and 
simulation, and have not yet been'confirmed experimentally. - \ , : . • ( < 

The first attempt to validate the "simulation- speed of Rijndael through 
experimental testing is described in ! [9]. r The test was performed using especially 
developed PCI card. Nevertheless, since' the operation of the system appeared to be 
limited by the PCI controller, no humeri cal; results of . the experimental tests Were 
reported in the paper. ,. . ; t - . \ t •" 

As a result, our paper is 'the first one that describes, the successful experimental 
testing of Rijndael. and directly compares the experiment a] results with ^imulanon. . 



8. Summary and Possible Extensions, ! v \ t ; ;( . / o f ; 

The IPSec-compliarit encryption/decryption units' of the- new. Advanced Encryption 
Standard - Rijndaer 'and the older encryption standard-Triple DES have been 
developed arid tested experimentally. Both' units Support the Cipher Block Chaining 
mode. Our experiment demonstrated up I* 16 ' 27% differences be^ the results 

obtained from testing, and results of ; thL\ static timings pn Xilinx lools. 

These differences confirmed thai the resu lis , based on the static, anaiyzcf' 'should be 
treated only as the worst-case. estimates. . •<!■•„• »•;. -{ ♦ . : 4 * . . Vi 

The * experimental procedure demonstrated „ that .the total; encryption and' 
decryption throughput of Rijndael and Triple DES in excess* of 1 Gbil/s can be 
achieved using a single FPGA device Virtex 1000: Only up lo 80% of resources of 
this single FPGA device are required by all cryptographic modules. The throughput- in 
excess of 3 Gbil/s can be accomplished by using' I wo remaining. FPGA devices 
present on the SLAAC-1-V accelerator board. Trie aiiernalive extensions include the : 
implementation and experimental testing ,of other "security trans Ibrmati oris of IPSec, 
such as HMAC and the Internet Key Exchange protocol. / ' 
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